Introduction

With future climate change droughts in the Amazon forest may become more frequent and/or severe. Droughts can turn Amazon regions from rain forest into savanna, leading to high amounts of carbon released into the atmosphere. Therefore, predicting future droughts and understanding the underlying mechanisms is of great interest. Ciemer et al. (2020), established an early warning indicator for droughts in the central Amazon basin (CAB), based on tropical Atlantic sea surface temperatures (SSTs). Inspired by their work, the aim of this thesis is to build up on this work and improve its predictive power by using different statistical methods. Here we seek to build a model that is able to predict monthly precipitation based on the sea surface temperatures. Also we want to identify those sea regions that are most important for doing so, making interpretability a point of interest, too. Firstly we will analyze the data descriptively to explore patterns and spatial dependencies. This includes a cluster analysis of the precipitation data in the central Amazon basin. Following we will compare two different regression approaches and their capability to predict precipitation as well as their interpretability of the SST regions selected by them. The first model is the lasso as proposed by R. Tibshirani (1996). Comparing different model specifications we will carry on the findings from the lasso and fit a (sparse) fused lasso on the data (R. Tibshirani et al. (2005)). Both models will be evaluated using a 5-fold forward selection, a model evaluation technique that takes into account the time dependencies present in the data at hand. We conclude with a summary of the findings in this work and give an overview of strengths and limitations of the approaches used together with ideas for future research.

This thesis was written and supervised in cooperation with Dr. Niklas Boers from the Potsdam Institute for Climate Impact Research (Climate Impact Research (PIK) e. V. (2021)) and Dr. Fabian Scheipl (LMU)

2 EDA

In this chapter we will inspect the values of the precipitation and SST data for the common observation period from 1981 until 2016.

2.1 EDA precipitation

In this section we want to study the time series of precipitation in the Central Amazon Basin. The CHIRPS data set contains the precipitation data, created from in-situ and satellite measurements (Funk et al. (2015)). It can be downloaded for example from here [https://www.chc.ucsb.edu/data/chirps]. It contains observations from 1981 to 2016 and comes on a high resolution of 0.05 grid, which we aggregate to a 0.5 grid.

Below the area of the Central Amazon Basin that is object of our study.

Location of the area under study. The central amazon basin (CAB) spanning across 0,-10 latitude and -70,-55 longitude

Figure 2.1: Location of the area under study. The central amazon basin (CAB) spanning across 0,-10 latitude and -70,-55 longitude

We will now inspect the precipitation values from three perspectives. Firstly the raw values without time or spatial dependency, second the mean and standard deviation for each spatial grid cell for the whole time series and then the mean and standard deviation for each grid cell but for each month of the year separately.

Firstly we plot precipitation values in general
Density of the raw precipitation values without time or spatial dependency.

(#fig:precip dens)Density of the raw precipitation values without time or spatial dependency.

Its form is a uni-modal, right-skewed density. The values range from 0 up to 769, but only few observations take these high values, forming a large tail. This might be a indication for large outliers in the data or due to some locations with very high precipitation values in general.

Mean and standard deviation at each location. The standard deviation was computed over the whole time period. The white line on the scale at the side of the plots indicates the mean of the respective quantity

Figure 2.2: Mean and standard deviation at each location. The standard deviation was computed over the whole time period. The white line on the scale at the side of the plots indicates the mean of the respective quantity

As we can see most locations have a mean precipitation of around 200 mm/month, over the whole time series. Regionally in the “upper left” corner of the Amazon Basin, mean precipitation is higher or equal to the mean. The reference point for “higher” is the mean of the location means. This region seems to be more or less spatially consistent. The rest of the region with lower mean precipitation has also some small areas where precipitation is again a little bit higher. For example in the upper right corner and on the bottom, right of the middle.

For the standard deviation we also see regional patterns. These patterns overlap with the regions of the mean but their magnitude is flipped. Meaning, in the upper left where we observe larger mean values we generally observe lower standard deviation and in the lower and upper right corners, higher standard deviations.

separately
Mean precipitation values at each location, shown for the different months of a year separately. The white line on the scale at the side of the plots indicates the mean of the respective quantity

Figure 2.3: Mean precipitation values at each location, shown for the different months of a year separately. The white line on the scale at the side of the plots indicates the mean of the respective quantity

We see spatial patterns of the mean evolving over time. For example: From May until August there is a spatial separation in two parts that dissolves in September. As expected there is a large seasonal component regarding the means.

For the standard deviation we see as well large differences in values during different months of the year.

2.2 Glyph plots

This section provides a graphical presentation of the precipitation data known as glyph plots. The idea of glyph maps, its application and general implementation that were used in this section are taken from @:wickham2012glyph. Glyph maps use a small icon or glyph to show multiple values at each location. In our case, we show a complete time series at each location instead of just single values. Different techniques can then be used compare the time series between all locations or their individual shape on a local scale. We will show seasonal, de-seasonalized, and de-seasonalized data on a local scale. Seasonal time series are computed by computing the averages of each month on each location. Each seasonal time series therefore has only 12 values and can be plotted without smoothing. The de-seasonalized time series are computed by omitting the seasonal effects on each time series for the complete observation period and therefore has to be smoothed to be visually inspectable. The de-seasonalized time series then can be used to compare the time series for each location on a common or local scale. On the common scale all values are displayed on the same axis range, while on the local scale the axis are changed so that their ranges refer to the range on the respective location. Rescaling is done as follows \[x_{rescaled} = \frac{x - min(x)}{max(x) - min(x)}.\]

This will help us to see the changes in value at each location relative to the range of the values at the same location. But this also means that interpreting these plots has to be done carefully because, in this form of display, large difference might actually refer to only small changes in absolute values. It can be due to the small range of values at that location in general, that these changes seem to be large. To aid the interpretation of these plots we can use color shadings to draw attention to areas in which ranges are large, meaning larger differences in their relative values also point to larger differences in their absolute values (i.e unscaled values, values on the global scale). Therefore locations with large ranges are shaded in lighter colors and smaller ranges are shaded in dark color, to make the lighter shaded areas more easily visible.

To improve readability of glyph maps, one can also add boxed for each glyph as well as reference lines for global means. This way the trajectory of the glyphs can be viewed in comparison to the overall mean directly.

Glyph map of seasonal precipitaton pattern. Each location is presented by a time series. The time series are seperated by boxes. The gray reference lines inside the boxes show the mid-range for easier comparison.

Figure 2.4: Glyph map of seasonal precipitaton pattern. Each location is presented by a time series. The time series are seperated by boxes. The gray reference lines inside the boxes show the mid-range for easier comparison.

The above figure is a glyph-map of seasonal precipitation patterns (averages for each month) in the Central Amazon Basin. The gray reference lines show the mid-range for easier comparison of the patterns. We see differences in the seasonal patterns across the map. In the upper left for example, the seasonal patterns stay above mid-range while on the bottom-left they have values clearly towards the low end of the range. Also some areas have multimodal patterns. The patterns differ in range and month of maximum and minimum precipitation.

Glyph map of de-seasonalised and smoothed precipitation. Each location is presented by a time series. The time series are seperated by boxes. The gray reference lines inside the boxes show the mid-range for easier comparison. The time series are scaled globally, same positions inside the cells correspond to the same values in all locations.

Figure 2.5: Glyph map of de-seasonalised and smoothed precipitation. Each location is presented by a time series. The time series are seperated by boxes. The gray reference lines inside the boxes show the mid-range for easier comparison. The time series are scaled globally, same positions inside the cells correspond to the same values in all locations.

This plot shows the smoothed de-seasonalized monthly precipitation, after global scaling. The same position within each cell corresponds to the same value in all locations. Some areas have almost a linear course, increasing, decreasing or constant. Others show a more “wiggly” courses. As overall pattern we can see that the forms of the patterns have a spatial connection, patterns are close to similar patterns, at the same latitude. Also regarding latitude the closer to the equator the less precipitation.

Glyph map of de-seasonalised and smoothed precipitation. The time series are scaled locally, ranges are not the same in all cells. The different ranges are given in color shades, where lighter shading indicates a larger range and darker shades smaller ranges.

Figure 2.6: Glyph map of de-seasonalised and smoothed precipitation. The time series are scaled locally, ranges are not the same in all cells. The different ranges are given in color shades, where lighter shading indicates a larger range and darker shades smaller ranges.

Now we inspect the glyph-map with de-seasonalized locally scaled values. This form of scaling emphasizes the individual shapes. Because of the applied scaling, big patterns may be just be tiny effects. Therefore colors are added according to range. Areas with lighter color have larger ranges than darker areas. The areas with steep linear increases and decreases have smaller ranges than or example the areas below -2.5 latitude in the left.

The results of the precipitation glyphs indicate that the CAB might be separable in different regions. If we can find a way to quantify the differences in these regions and separate them into clusters, we could then apply our regression models to each of these clusters and eventually improve model accuracy on each region as compared to the complete are on average. Therefore in a later section we will discuss and apply clustering algorithms to the precipitation data. But for now we will have a look at the SST data.

2.3 EDA SST

We explore the sea surface temperature data set, used in the paper by Ciemer et al (Ciemer et al. (2020)). ERSST (Extended Reconstructed Sea Surface Temperature, Huang et al. (2017)) is a reanalysis from observed data given in the International Comprehensive Ocean-Athmosphere Data Set (ICOADS). Which contains observations from 1800 until 2016, made by ships and buoys for example. The data comes on a 2x2 degree grid, where data was missing interpolation techniques were used. See paper for reference. the file contains two variables that are measured across different dimensions. The two variables contain the sea surface temperatures and the respective SST anomalies (with respect to the 1971-2000 monthly climatology). Here we analyze the raw SST values, since the anomalies are computed to a climatology that spans a time frame we will use in the analysis. Using the anomalies therefore would eventually introduce information leakage because during the training process future values were used for fitting the model.

Mean and SD on the spatial map.

Figure 2.7: Mean and SD on the spatial map.

We see spatial patterns of the mean evolving over time. For example: From May until August there is a spatial separation in two parts that dissolves in september. As expected there is a large seasonal component regarding the means.

For the standard deviation we see as well large differences in values during different months of the year.

3 Correlation analysis

3.1 Short Recap

We give a short overview over the correlation between monthly sea surface temperature and monthly mean precipitation in the Central Amazonas Basin (CAB). First we will analyse the original and then the deaseasonalised data. SST and precipitation data have been deseasonalised, meaning first each time series was decomposed by the stl algorithm according to \[Monthly \textit{ } Data = Seasonal + Trend + Remainder\] Afterwards only trends and remainders time series were kept to constitute a new time series that will be used as predictor (sst) and target (precipitation).

In a next step we compute the correlations between each sst grid point time series and the mean precipitation time series. Since our goal is to predict the precipitation on the sst information, we are also interested in predicting future precipitation some months ahead. To examine this we also compute the correlations for different time lags. For example we might use January sst data to predict precipitation in June, given a time lag of 6 months.

We consider time lags of 0,3,6 and 12 months. And show the density of the correlation values as well as their spatial distribution on a map. We also display the highest positive and negative correlation based on their respective 2.5% and 97.5% quantiles. All correlations that are between these values are set to 0 then.

The correlation measure we use is the

3.2 Correlation of Sea Surface Temperature and Precipitation

3.2.1 Original Data

Following, for each timelag we show the respective density of correlation values, their location on the map and also the 5% strongest positive and negative correlations.

3.2.1.1 Timelag 0

Inspecting the density plot for timelag 0, we see two modi for correlations, one for negative correlations around -0.8 and one for possitive correlations around 0.8. Also a small spike can be seen for low negative correlations. If we plot these correlations on the respective grid points we see a clear north-south negative-positive correlation distinction. The “boarder” is organised around the equator. The plot for the strongest 5% of correlations reveals areas with strong positive and negative correlations in the north and south respectively.

3.2.1.2 Timelag 3

The density of correlations for timelag 3, is left-skewed and has two modi that are organised around 0 and -0.125 respectively. The correlation map shows that the high positive and negative correlations are more close to equator here. Note that the legend for the correlationmap is “shifted” here, because the maximal negative correlation has a higher absolute value than the maximal positive correlation. The strongest correlations also seem to be shifted towards the equator.

3.2.1.3 Timelag 6

We can see the density plot for timelag 6 is pretty similar to the one of timelag 0 but seems to be “flipped” around 0. Similarly the correlation map shows (high) negative correlations in the south now and high postive correlations in the north.

3.2.1.4 Timelag 12

Giving a timelag of one year, we can see that the distribution of correlations is now again similar to the distribution for timelag 0. This also hold for the location of positive and negative correlations in general, as well as for the strongest 5% of correlations.

3.2.2 Deseasonalised Data

Following, for each timelag we show the respective density of correlation values, their location on the map and also the 5% strongest positive and negative correlations.

3.2.2.1 Timelag 0

Inspecting the density plot for timelag 0, we see that after excluding seasonality from the time series we get a left-skewed distribution of correlations. With a mode around 0. In general the correlation values are a lot lower than in the original data. With a maximum at around -0.4 and +2.5 respectively. We plot these correlations on the respective grid and see that the clear north south distinction in the correlations before deseasonalising the data does not appear anymore. The plot for the strongest 5% of correlations reveals areas with strongest positive and negative correlations. But as stated before the values are in general much lower.

3.2.2.2 Timelag 3

For the density of timelag 3 we get a similar picture as for timelag 0. The mode is a higher and the tails get a bit more mass. Also the correlation map does not seem to change a lot. The strongest correlations appear to be shifted to the left.

3.2.2.3 Timelag 6

Since the distributions of correlation values are all unimodal we do not observe the “flip” we saw in the original data when comparing the densities of timelag 0 and 6. The mode again gets larger and the maximum postive and negative correlation values get smaller. The strongest negative correlations are shifted further to left.

3.2.2.4 Timelag 12

Given a timelag of one year, the distribution now has a mode around 6, and started at around 4 when timelag was 0. Also neither the positive nor negative correlations exceed values of 2.5. The consistent region of strong negative and positive correlations is now less organised or more scattered.

3.3 Summary

3.3.1 Original Data

We can observe that the positive and negative correlations of sst and precipitation follow a spatial and temporal pattern. The location and density of the positive and negative correlation “wanders” over the equator in opposite directions. The densities and correlationmaps for timelag 0 and 6 appear to be quite similar but “flipped”. The densities and correlationmaps for timelag 0 and 12 appear again to be similar. The same pattern seems to hold for the strongest correlations.

3.3.2 Deseasonalised Data

Correlation values are in general a lot lower than in the original data and decrease with increasing timelag. We still observe temporal and regional patterns, although these dissolve a bit for a timelag of 12.

4 Clustering

In this chapter we will first summarize the main ideas of clustering and then apply it to the precipitation data. If not indicated otherwise the information is taken from Elements of Statistical Learning.

4.1 Main Idea Clustering

We can describe an object by a set of measurements or its similarity to other objects. Using this similarity we can put a collection of objects into subgroups or clusters. The objects in the subgroups should then be more similar to one another than to objects of different subgroups. This means inside the clusters we aim for homogeneity and for observations of different clusters for heterogeneity. With the clustering analysis applied to the precipitation data we want to study if there are distinct groups (regions) apparent in the CAB. So that if we later apply the regression models we predict the precipitation for each group and not for the whole region.

To explore the grouping in the data we need a measure of (dis)similarity. This measure is central and depends on subject matter considerations. We construct the dissimilarities based on the measurements taken for each month. We interpret this as a multivariate analysis where, each month is one variable. So given the area in the CAB (resolution 5°x5°), we have 612 cells and 432 months, resulting in a \(612 \times 432\) data matrix. we want to cluster cells into homogen groups.

4.2 Clustering Methods

4.2.1 K-means

In the following we briefly describe the K-means procedure. Beforehand we have to specify a number of clusters \(C\) we believe exist in our data. Then we randomly initialize \(C\) cluster centers in the feature space. Now two steps are repeated until convergence:

  1. For each center we identify the points that are the closest to this center.
    These points “belong” now to a cluster \(C\).
  2. In each cluster we compute the mean of each variable and get a vector of means. This mean vector is now the new center of the cluster.

As a measure of dissimilarity we use the Euclidean distance:

\[\begin{equation} d(x_i,x_{i´}) = \sum_{j=1}^p(x_{ij}-x_{i´j})^2=||x_i-x_{i´}||^2 \tag{4.1} \end{equation}\]

Meaning for the points \(i\) and \(i´\) we compute the squared difference for each variable and sum them up. As stated above we are searching for clusters that are themselves compact, meaning homogeneous. We do so by minimizing the mean scatter inside the clusters. We summarize this scatter as within-sum-of-squares

\[\begin{equation} W(C) = \frac{1}{2} \sum_{k=1}^{K} \sum_{C(i)=k} \sum_{C(i´)=k} ||x_i-x_{i´}||^2 = \sum_{k=1}^K N_k \sum_{C(i)=k} ||x_i-\bar{x}_k ||^2 \tag{4.2} \end{equation}\]

where \(\bar{x} = (\bar{x}_{1k},...,\bar{x}_{pk})\) stands for the mean vectors of the \(k\)th cluster and \(N_k = \sum_{i=1}^N I(C(i)=k)\).

4.2.2 Kmeans characteristics

variance of each distribution of each attribute (variable) is spherical, variance is symmetrical? all variables have same variance, not the case in our example, therefore scaling or pca equal number of observations in each clusters, we don`t know

Since we use the Euclidean distance the similarity measures will be sensitive to outliers and scale. K-means assumes that the variance of a variable´s distribution is spherical, meaning it wight not work well in situations that violate this assumptions (f.e non-spherical data). Further assumptions are same variance of the variables, and equally sized clusters. Now how “large” these violations have to be so that k-means does not work well anymore has no clear-cut answer.

4.2.3 K-medoids

We can adjust k-means procedure so that we can use other distances than the Euclidean distance. The only part of the k-means algorithm that uses Euclidean distance is when we compute the cluster centers. We can replace this step by formalizing an optimization with respect to the cluster members. For example so that each center has to be one of the observations assigned to the cluster. K-medoids is far more computationally intensive than K-means.

4.2.3.1 K-medoids characteristics

K-medoids is less sensitive to outliers, because it minimizes sum of pairwise dissimilarities instead of sum of squared Euclidean distances. As stated above it is also more computationally intensive.

4.2.4 PCA

Goal is reduction of correlated and eventually large number p variables to a few. We accomplish this by creating new variables that are linear combinations of the original ones. We call these new variables principle components. The new variables are not correlated any more and ordered according to the variance they explain. The first \(k < p\) principal components then contain the majority of variance (Fahrmeir et al. (1996)). As they are ordered they also provide a sequence of best linear approximations of our data. Let \(x_1, x_2,...,x_N\), be our observations and we represent them by a rank-q linear model

\[\begin{equation} f(\lambda) = \mu + V_q\lambda \tag{4.3} \end{equation}\]

with \(\mu\) a location vector in \(\mathbb{R}^p\) and \(V_q\) is a \(p \times q\) matrix with columns being orthogonal unit vectors as columns. \(\lambda\) is a \(q\) vector of parameters. In other words we are trying to fit a hyperplane of rank q to the data. If we fit this model to minimize reconstruction error using least squares we solve

\[\begin{equation} \min_{\mu, \{\lambda_i\}, V_q } \sum_{i=1}^N||x_i - \mu - V_q\lambda_i ||^2 \tag{4.4} \end{equation}\]

If we partially optimize for \(\mu\) and \(\lambda_i\) we obtain

\[ \hat{\mu} = \bar{x},\] \[\hat{\lambda_i} = V_q^T(x_i - \bar{x}) \] Therefore we need to search for the orthogonal matrix \(V_q\)

\[\begin{equation} \min_{V_q} \sum_{i=1}^N ||(x_i-\bar{x}) - V_qV_q^t(x_i-\bar{x})||^2 \tag{4.5} \end{equation}\]

We can assume here that \(\bar{x}\) is 0, if this not the case we can simply center the observations \(\tilde{x}_i = x_i - \bar{x}\). \(H_q = V_qV_q^T\) projects each observation \(x_i\) from the original feature space onto the subspace that is spanned by the columns of \(V_q\). \(H_qx_i\) is then the orthogonal projection of \(x_i\) Hence \(H_q\) is also called the projection matrix.

We find \(V_q\) then by constructing the singular value decomposition of our data matrix X. \(X\) contains the (centered) observations in rows, giving a \(N \times p\) matrix. The SVD is then:

\[\begin{equation} X = UDV^T \tag{4.6} \end{equation}\]

Where \(U\) is an orthogonal matrix containing the left singular vectors \(u_j\) as columns, and \(V\) contains the right singular vectors \(v_j\). The columns of \(U\) span the columns space of \(X\) and the columns of \(V\) span the row space. \(D\) is diagonal matrix which contains singular values, \(d_1 \leq d_2 \leq ... \leq d_p \leq 0\). Singular values are square roots of non-negative eigenvalues. The columns of \(UD\) are the principal components of \(X\). So the SVD gives us the matrix \(V\) (the first \(q\) columns give the solution to the minimization problem above) as well as the principal components from \(UD\) (Hastie et al. (2009)).

4.2.5 Gap statistic

The idea of the gap statistic was introduced by R. Tibshirani, Walther, and Hastie (2001) As stated above we usually measure how compact our clusters are by assessing \(W(C)\) or \(log(W_c)\). Where low values indicate compact clusters. To compare the value then, we need a reference. We therefore want to estimate how large \(W_c\) were if there were no clusters present in our data. The larger the difference between the \(W_c\) from the data and the one from the reference the more likely we are to say that the found number of clusters is indeed correct. We construct reference data by sampling from a uniform distribution based on our data. Say we have \(p\) variables. We sample \(n\) times from each of the \(p\) uniformly distributed variables, where maximum and minimum are obtained from our data. We then cluster the reference data in the same we cluster our observed data and compute \(W_c\). We repeat this process several times and compute the average of \(W_c\), \(E \{log(W_c)\}\). The gap statistic is then the difference

\[\begin{equation} E \{log(W_c) \} - log(W_c). \tag{4.7} \end{equation}\]

So in cases where our data is formed of clusters we would expect a high gap statistic. As R. Tibshirani, Walther, and Hastie (2001) note and as it is also done in the used R function clusgap, doing a PCA on the data and compute the gap statistic on the PCA scores can improves the results of gap statistic.

4.3 Analyse clustering results

We now analyze further the results we found. Namely the clusters we find after performing a PCA on the data, using the 3 first principal components and searching for 5 clusters using k-means. The first 3 principal components explain 67.8% of the variance We found 5 as optimal number of clusters based on the gap statistic and the criteria from Tibsherani.

We show the map plot for the k-means clustering after applying the PCA as well as the time series of the resulting clusters.

Spatial distribution of the found clusters in the CAB. We applied a centered PCA on the data and used 3 principal components before applying the K-means algorithm

Figure 4.1: Spatial distribution of the found clusters in the CAB. We applied a centered PCA on the data and used 3 principal components before applying the K-means algorithm

The plot above shows the grid cells in the Central Amazon Basin colored according to the clusters that are assigned by k-means. The clusters are almost completely spatially coherent. Meaning that the clusters are not scattered across different areas. One exception can be seen for Cluster 1 and 4. Parts of cluster 1 (orange) are inside cluster 4 (blue) and on the edge to cluster 3 (green).

Plots of the time series in the clusters we found using the K-means algorithm. The x-axis shows the month of measurement, the y-axis the centered precipitation. Centering was done according to the overall CAB mean in each month. The mean inside each cluster for a month is displayed in blue.

Figure 4.2: Plots of the time series in the clusters we found using the K-means algorithm. The x-axis shows the month of measurement, the y-axis the centered precipitation. Centering was done according to the overall CAB mean in each month. The mean inside each cluster for a month is displayed in blue.

We now inspect the original (centered) time series inside the clusters.
The time series are shown in gray and the monthly mean in the cluster is shown in blue. Since the time series are centered before applying the PCA and clustering, the zero value is the mean of the respective month of the whole CAB.
The clusters differ in their monthly differences from the monthly CAB mean (here 0 because the time series were centered before PCA and K-means) and their fluctuation/ variance. Also the size of the clusters are not all the same.
The mean in cluster 3 has lowest variability around the CAB mean, followed by clusters 2 and 5, and clusters 1 and 4 have the highest variability.
On average cluster 3 is on the level of the CAB overall mean. The clusters 2 and 5 are slightly below and cluster 4 is above the CAB mean, on average. Cluster 1 is on average also on the CAB mean but shows more variance than cluster 3.

5 LASSO Regression

5.1 Introduction

We want to create a model that has predictive and explanatory power. Predictive power meaning it can predict the precipitation in the Central Amazon Basin, “reasonably well”. Explanatory power in the sense of being interpretable, so that we can identify those regions in sea that give us most information about future precipitation. Our problem setting is high dimensional with n << p. The number of predictors is a lot bigger than the number of actual observations. This creates issues with a classic linear model since the linear problem is underdetermined. One possible model for the problem at hand is a LASSO regression model.

In general for the linear model:

\[\begin{equation} y_i = \beta_0 + \pmb{x}_i^T\pmb{\beta} + \epsilon_i \tag{5.1} \end{equation}\]

see (5.1) Where the \(y_i\) refers to the mean precipitation for a given month \(i\) and \(x_i\) is the vector of sea surface temperature at different locations around the globe. \(\epsilon_i\) is the residual and we wish to estimate the \(\beta\)´s from the data. As already stated this is not possible with a classic linear model since the number of predictors exceeds the number of observations. We therefore can not estimate a \(\beta\) for every grid point in the sea. From a physical point of view it also seems reasonable that some regions in the ocean have a higher predictive power than others. For example regions that are closer to the Amazon may have more influence on precipitation in the same month. But regions more far away may have more information on the precipitation half a year in the future. We therefore would like to use a model that can find the most important regions in the sea for predicting precipitation for some point in the future. One possible solution for this is a LASSO regression model, as implemented in R by the glmnet package (Friedman, Hastie, and Tibshirani (2010)). This model “automatically” performs model selection, but be aware that because of the time dependencies in our data, normal Cross Validation methods may be unjustified or at least have to be applied with caution. The glmnet package implements the regression problem in the following manner, solving:

\[\begin{equation} \min_{\beta_0, \beta} \frac{1}{N} \sum_{i=1}^N w_il(y_i,\beta_0 + \beta^Tx_i) + \lambda [(1-\alpha)||\beta||_2^2/2 + \alpha||\beta||_1] \tag{5.2} \end{equation}\]

This is a lasso regression for \(\alpha = 1\) and ridge regression for \(\alpha = 0\), \(\alpha\) controls the overall strength of regularization or penalty. Intuitively this means we try to find those \(\beta\)´s that minimize the negative log likelihood of our data (this is equal to maximizing the log-likelihood). But at the same time we can not include too many \(\beta\) since this will make the second and third term in the formula grow. As result the algorithm chooses only those predictors that have the most predictive power. How many predictors are included depends on the strength of regularization given by \(\alpha\). Remark: Among strongly correlated predictors only one is chosen in the classical lasso model. Ridge regression shrinks the coefficients to zero. Elastic net with \(\alpha=0.5\) tends to either include or drop the entire group together. To specifically choose a group of predictors, variations of the lasso or other models have to be considered.

5.2 Implementation

The glmnet function finds a solution path for the lasso problem via coordinate descent. The implemented algorithm was suggested by Van der Kooij (2007). We can write down the optimization procedure as follows: Given \(N\) observation pairs \((x_i, y_i)\) with \(Y \in \mathbb{R}\) and \(X \in \mathbb{R}^p\), we approximate the regression function with \(E(Y|X = x) = \beta_0 + x^T\beta\), Here \(x_{ij}\) are considered standardized, so \(\sum_{i=1}^N=0\),\(\frac{1}{N}\sum_{i=1}^Nx_{ij}^2=1\) for \(j = 1,...,p\). We then solve the problem:

\[\begin{equation} \min_{(\beta_0, \beta)\in\mathbb{R}^{p+1}} \big{[} \frac{1}{2N} \sum_{i=1}^N (y_i - \beta_0 - x_i^T\beta)^2 + \lambda [(1-\alpha)||\beta||_2^2/2 + \alpha||\beta||_1 \big{]} \tag{5.3} \end{equation}\]

Note that this solves the elastic net problem that also uses a ridge penalty. We follow the elastic net description but in our case \(\alpha = 1\), using only the lasso penalty. We consider now a coordinate descent step for solving (5.3). Given we have estimates \(\tilde{\beta_0}\) and \(\tilde{\beta_l}\) and we want to partially optimize with respect to \(\beta_j\), and \(i \neq j\). When \(\beta_j > 0\),

\[\begin{equation} \left.\frac{\partial R_{\lambda}}{\partial \beta_{j}}\right|_{\beta=\tilde{\beta}}=-\frac{1}{N} \sum_{i=1}^{N} x_{i j}\left(y_{i}-\tilde{\beta}_{0}-x_{i}^{\top} \tilde{\beta}\right)+\lambda(1-\alpha) \beta_{j}+\lambda \alpha . \tag{5.4} \end{equation}\]

And similar expressions exist for \(\tilde{\beta}_{j}<0\). \(\tilde{\beta}_{j}=0\) is treated separately. The coordinate-wise update has then the form:

\[\begin{equation} \tilde{\beta}_{j} \leftarrow \frac{S\left(\frac{1}{N} \sum_{i=1}^{N} x_{i j}\left(y_{i}-\tilde{y}_{i}^{(j)}\right), \lambda \alpha\right)}{1+\lambda(1-\alpha)} . \tag{5.5} \end{equation}\]

with

  • \(\tilde{y}_{i}^{(j)}=\tilde{\beta}_{0}+\sum_{\ell \neq j} x_{i \ell} \tilde{\beta}_{\ell}\) standing for fitted value without the contribution from \(x_{i j}\), and therefore \(y_{i}-\tilde{y}_{i}^{(j)}\) is the partial residual when fitting \(\beta_{j}\). Because we applied a standardization, \(\frac{1}{N} \sum_{i=1}^{N} x_{i j}\left(y_{i}-\tilde{y}_{i}^{(j)}\right)\) denotes the simple least-squares coefficient for fitting this partial residual to \(x_{i j}\).

  • \(S(z, \gamma)\) being the soft-thresholding operator. It’s value is given by:

\[\begin{equation} \operatorname{sign}(z)(|z|-\gamma)_{+}= \begin{cases}z-\gamma & \text { if } z>0 \text { and } \gamma<|z| \\ z+\gamma & \text { if } z<0 \text { and } \gamma<|z| \\ 0 & \text { if } \gamma \geq|z| .\end{cases} \tag{5.6} \end{equation}\]

So in summary the steps are as follows: Compute the simple least-squares coefficient on the partial residual, then apply soft-thresholding and proportional shrinkage for the lasso and ridge penalty, respectively. Again for our use case, since we use the lasso and \(\alpha = 1\), we only apply soft-thresholding and no proportional shrinkage.

The solutions are computed starting from smallest \(\lambda_{max}\) for which all elements in \(\hat{\beta}=0\). For all larger \(\lambda\) the coefficients then stay 0. The smallest \(\lambda\) value \(\lambda_{min}\) is then selected by \(\lambda_{min}=\epsilon \lambda_{max}\). The complete searched vector is constructed as sequence of K values, typical values are \(\epsilon = 0.001\) and \(K = 100\). This procedure is an example of so called warm starts. By default they always center the predictor variable. For additional information on other methods how speedup is obtainred refer to Section 2 in Friedman, Hastie, and Tibshirani (2010).

5.3 TODO here

maybe drop into with linear model just put lasso formula directly talk about the stuff that is written already note probelms with correlation of predictors and grouping

5.4 Results

Following we summarize the results of the different lasso approaches. For each approach we show the prediction errors for each test set during the forward validation as well as the actual predictions. The \(\lambda\) that achieved the minimum MSE during the forward validation, will be used to fit the model to the complete training data and evaluated on the hold-out validation set at the end of the observation period. For the full model we show the predictions and resulting coefficient map.

5.4.1 lasso

MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.1: MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.2: Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

Figure 5.3: Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

## [1] 1314.929

5.4.2 standardized lasso

Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.4: Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.5: MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

Figure 5.6: Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

## [1] 1214.489

5.4.3 deseas lasso

MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.7: MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.8: Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

Figure 5.9: Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

## [1] 1809.455

5.4.4 diff1 lasso

MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.10: MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.11: Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

Figure 5.12: Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

6 The fused lasso

6.1 Introduction

As expected and seen in the results, the different LASSO models choose single SST regions as predictors as opposed to whole regions. When predictors are highly correlated the lasso tends to choose only one of the variables and discard the others. Also the LASSO only regularizes the magnitude of coefficients but ignores their ordering.
We therefore use the so-called fused lasso as implemented in the genlasso package and the respective fusedlasso function. (R. Tibshirani et al. (2005), Taylor B. Arnold and Tibshirani (2020)) The fused lasso is a generalization of the lasso for problems with features that can be ordered in a meaningful way. It penalizes not only the coefficients’ \(L_1\)-norm but also their differences given their ordering, introducing sparsity in both of them. In our case the fused lasso thus penalizes the differences of SST coefficients that are close to each other.

The fused LASSO as implemented in genlasso package solves the problem:

\[\begin{equation} \min_{\beta} 1/2 \sum_{i=1}^n(y_i - x_i^T\beta_i)^2 + \lambda \sum_{i,j \in E} |\beta_i - \beta_j| + \gamma \cdot \lambda \sum_{i=1}^p|\beta_i|, \tag{6.1} \end{equation}\]

with \(x_i\) being the ith row of the predictor matrix and E is the edge set of an underlying graph. Regularizing \(|\beta_i-\beta_j|\), penalizes large differences in close coefficients. In our case “close” means small distances as defined on 2-dimensional longitude/latitude grid. This grid defines a graph that can be used to compute the distances for each location. The third term \(\gamma \cdot \lambda \sum_{i=1}^p|\beta_i|\), controls the sparsity of the coefficients. \(\gamma=0\) leads to complete fusion of the coefficients (no sparsity) and \(\gamma\) > 0 introduces sparsity to the solution, with higher values placing more priority on sparsity. \(\hat{\beta}\) is computed as a function of \(\lambda\), with fixed \(\gamma\).

6.2 Implementation

The summary of the algorithm is taken from the paper proposing the implementation, Taylor B. Arnold and Tibshirani (2016) and the original paper introducing the algorithm R. J. Tibshirani and Taylor (2011). In the fused lasso setting the coefficients \(\beta \in \mathbb{R}^p\) can be thought of as nodes of a given undirected Graph \(G\), with edge set \(E \subset {1,...,p}^2\). Now lets assume that \(E\) has \(m\) edges which are enumerated \(e_1,...,e_m\). The fused lasso penalty matrix \(D\) is then \(m \times p\), where each row corresponds to an edge in \(E\). So when \(e_l = (i,j)\), we write \(l_{th}\) row of \(D\) as

\[\begin{equation} D_l = (0,...,-1,...,1,...) \in \mathbb{R}^p, \tag{6.2} \end{equation}\]

meaning \(D_l\) has all zeros except for the the \(i_{th}\) and \(j_{th}\) location.

(6.1) is solved by a dual path algorithm that was proposed by Taylor B. Arnold and Tibshirani (2016) for different use cases of the (sparse) fused lasso. They describe the dual path algorithm based on the notation of the generalized lasso problem R. J. Tibshirani and Taylor (2011):

\[\begin{equation} \hat{\beta}=\underset{\beta \in \mathbb{R}^{p}}{\arg \min } \frac{1}{2}\|y-X \beta\|_{2}^{2}+\lambda\|D \beta\|_{1}, \tag{6.3} \end{equation}\]

where \(y \in \mathbb{R}^n\) is the vector of the outcome, \(X \in \mathbb{R}^{n \times p}\) a predictor matrix, \(D \in \mathbb{R}^{m \times p}\) denotes a penalty matrix, and \(\lambda \geq 0\) is a regularization parameter. The dual path algorithm solves not the primal but the dual solution of the problem and computes the solution for a whole path instead of single values of \(\lambda\). Hence the “dual” and “path” that make up the name. Taylor B. Arnold and Tibshirani (2016) argue that the strength of the original algorithm lays in the fact that it applies to a unified framework in which \(D\) can be a general penalty matrix. Let’s consider the case when \(X=I\) and \(rank(X)=p\) (this is called the “signal approximator” case), the dual problem of (6.3) is then:

\[\begin{equation} \hat{u} \in \underset{\substack{u \in \mathbb{R}^{\omega}}}{\arg \min } \frac{1}{2}\left\|y-D^{T} u\right\|_{\frac{2}{2}} \text { subject to }\|u\|_{\infty} \leq \lambda. \tag{6.4} \end{equation}\]

The primal and dual solutions, \(\hat{\beta}\) and \(\hat{u}\) are related by:

\[\begin{equation} \hat{\beta}=y-D^{T} \hat{u} . \tag{6.5} \end{equation}\]

While the primal solution is unique, this does not need to be the case for the dual solution (note the element notation in (6.4). The dual path algorithm starts at \(\lambda = \infty\) and computes the path until \(\lambda = 0\). Conceptually the algorithm keeps track of the coordinates of the dual solutions it computed for each lambda \(\hat{u}(\lambda)\). The solutions are equal to \(\pm\lambda\), meaning they lie on the boundary of the region \([-\lambda,\lambda]\). Along the path it computes the critical values of \(\lambda\), \(\lambda_1 \geq \lambda_2,...,\) at which the coordinates of these solutions hit or leave the boundary.

There are two algorithms described in the paper and the various specialized implementations that can increase efficiency depending on the use cases. This depends on \(X\), and/or the special structure of \(D\). Algorithm 1 handles the \(X=I\) case and Algorithm 2 the general \(X\) case. As we introduced the dual in (6.4), it assumed \(X=I\), which is not satisfied in our case. For the general \(X\) case the problem formulation can be rewritten so that the formula only changes \(D\) and \(y\) to \(\tilde{D}\) and \(\tilde{y}\) and then the same algorithm can be applied. \(\tilde{D}=DX^+\) and \(\tilde{y}=XX^+y\), where \(X^x\) is the Moore-Penrose pseudoinverse of \(X \in \mathbb{R}^{n \times p}\). Algorithm 2 therefore transforms \(X\) and \(y\) in a certain way and then applies Algorithm 1 to the transformed problem. It’s also easy to see that in our case \(p > n\) and \(X\) is column rank deficient. They solve this by adding a small fixed \(l_2\) penalty to the original problem,which leads to:

\[\begin{equation} \operatorname{minimize}_{\beta \in \mathbb{R}^{p}} \frac{1}{2}\|y-X \beta\|_{2}^{2}+\lambda\|D \beta\|_{1}+\varepsilon\|\beta\|_{2}^{2}, \tag{6.6} \end{equation}\]

and this is the same as

\[\begin{equation} \underset{\beta}{\operatorname{minimize}} \frac{1}{2}\left\|y^{*}-\left(X^{*}\right) \beta\right\|_{2}^{2}+\lambda\|D \beta\|_{1}, \tag{6.7} \end{equation}\]

with \(y^{*}=(y, 0)^{T}\) and \(X^{*}=\left[\begin{array}{c}x \\ \varepsilon \cdot I\end{array}\right]\). Because \(\operatorname{rank}\left(X^{*}\right)=p\), again it is possible to apply one of the algorithms. Instead of solving linear systems in each, we can apply a QR decomposition that can be updated in a neat way to avoid solving the complete linear system in each step. See the appendix of Taylor B. Arnold and Tibshirani (2016), for details. Special care has to be taken though for general X and certain \(D\). “Blindly” applying the algorithms then would lead to a large drop in relative efficiency. Since \(\tilde{D}= DX^+\) destroys the special structures that are present in the penalty matrix, special implementations are constructed including our case with general \(X\) and \(D\) coming from the sparse fused lasso.

It can be shown that the computational costly steps in the algorithm reduce to solving linear systems of the form \(DD^T = Dc\). When \(D\) is the oriented incidence matrix of a graph, \(D\) will be sparse but can also be rank deficient when the graph has more edges than nodes, hence \(m > p\). For sparse undetermined systems it is possible to find an arbitrary solution (here called the basic solution), but computing the solution with minimum \(l_2\) norm is a lot more difficult in general Van Loan and Golub (1996). It is possible though to derive the minimum \(l_2\) norm solution from the basic solution. In the case of penalty matrices that come from a graph the structure of \(D\) can be used to improve efficiency when When \(D\) is the incidence matrix of a graph, then \(D^TD\) is the Laplacian matrix of G. The Laplacian linear systems are then solved using a sparse Cholesky decomposition. For further details of the steps used in our case refer to Section 4 and 5 in Taylor B. Arnold and Tibshirani (2016).

7 References

Arnold, Taylor B., and Ryan J. Tibshirani. 2020. Genlasso: Path Algorithm for Generalized Lasso Problems. https://CRAN.R-project.org/package=genlasso.
Arnold, Taylor B, and Ryan J Tibshirani. 2016. “Efficient Implementations of the Generalized Lasso Dual Path Algorithm.” Journal of Computational and Graphical Statistics 25 (1): 1–27.
Ciemer, Catrin, Lars Rehm, Juergen Kurths, Reik V Donner, Ricarda Winkelmann, and Niklas Boers. 2020. “An Early-Warning Indicator for Amazon Droughts Exclusively Based on Tropical Atlantic Sea Surface Temperatures.” Environmental Research Letters 15 (9): 094087.
Climate Impact Research (PIK) e. V., Potsdam Institute for. 2021. “Potsdam Insitute for Climate Impact Research.” 2021. https://www.pik-potsdam.de/en.
Fahrmeir, Ludwig, Wolfgang Brachinger, Alfred Hamerle, and Gerhard Tutz. 1996. Multivariate Statistische Verfahren. Walter de Gruyter.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.
Funk, Chris, Pete Peterson, Martin Landsfeld, Diego Pedreros, James Verdin, Shraddhanand Shukla, Gregory Husak, et al. 2015. “The Climate Hazards Infrared Precipitation with Stations-a New Environmental Record for Monitoring Extremes.” Scientific Data 2 (1): 1–21.
Hastie, Trevor, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.
Huang, Boyin, Peter W Thorne, Viva F Banzon, Tim Boyer, Gennady Chepurin, Jay H Lawrimore, Matthew J Menne, et al. 2017. “NOAA Extended Reconstructed Sea Surface Temperature (ERSST), Version 5.” NOAA National Centers for Environmental Information 30: 8179–8205.
Schulzweida, Uwe. 2019. “CDO User Guide (Version 1.9. 6).” Max Planck Institute for Meteorology: Hamburg, Germany.
Smith, Thomas M, Richard W Reynolds, Thomas C Peterson, and Jay Lawrimore. 2008. “Improvements to NOAA’s Historical Merged Land–Ocean Surface Temperature Analysis (1880–2006).” Journal of Climate 21 (10): 2283–96.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.
Tibshirani, Robert, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2005. “Sparsity and Smoothness via the Fused Lasso.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1): 91–108.
Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2001. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2): 411–23.
Tibshirani, Ryan J, and Jonathan Taylor. 2011. “The Solution Path of the Generalized Lasso.” The Annals of Statistics 39 (3): 1335–71.
Van der Kooij, Anita J. 2007. Prediction Accuracy and Stability of Regression with Optimal Scaling Transformations. Leiden University.
Van Loan, Charles F, and G Golub. 1996. “Matrix Computations (Johns Hopkins Studies in Mathematical Sciences).” Matrix Computations.